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Abstract. As increasing amounts of sensitive personal information is 
aggregated into data repositories, it has become important to develop 
mechanisms for processing the data without revealing information about 
QO ' individual data instances. The differential privacy model provides a frame- 

04 , work for the development and theoretical analysis of such mechanisms. 

In this paper, we propose an algorithm for learning a discriminatively 
trained multi-class Gaussian classifier that satisfies differential privacy 
using a large margin loss function with a perturbed regularization term. 
We present a theoretical upper bound on the excess risk of the classifier 
introduced by the perturbation. 
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1 Introduction 



In recent years, vast amounts of personal data is being aggregated in the form of 
medical, financial records, social networks, and government census data. As these 
^O ' often contain sensitive information, a database curator interested in releasing a 

(^ . function such as a statistic evaluated over the data is faced with the prospect 

^D ' that it may lead to a breach of privacy of the individuals who contributed to the 

l/^ ■ database. It is therefore important to develop techniques for retrieving desired 

f^ ' information from a dataset without revealing any information about individual 

^3 . data instances. Differential privacy [I] is a theoretical model proposed to ad- 

dress this issue. A query mechanism evaluated over a dataset is said to satisfy 
differential privacy if it is likely to produce the same output on a dataset dif- 
fering by at most one element. This implies that an adversary having complete 
knowledge of all data instances but one along with a priori information about 
C^ ' the remaining instance, is not likely to be able to infer any more information 

about the remaining instance by observing the output of the mechanism. 

One of the most common applications for such large data sets such as the 
ones mentioned above is for training classifiers that can be used to categorize new 
data. If the training data contains private data instances, an adversary should 
not be able to learn anything about the individual training dataset instances by 
analyzing the output of the classifier. Recently, mechanisms for learning differ- 
entially private classifiers have been proposed for logistic regression [5]. In this 
method, the objective function which is minimized by the classification algorithm 
is modified by adding a linear perturbation term. Compared to the original clas- 
sifier, there is an additional error introduced by the perturbation term in the 



differentially private classifier. It is important to have an upper bound on this 
error as a cost of preserving privacy. 

The work mentioned above is largely restricted to binary classification, while 
multi-class classifiers are more useful in many practical situations. In this pa- 
per, we propose an algorithm for learning multi-class Gaussian classifiers which 
satisfies differential privacy. Gaussian classifiers that model the distributions of 
individual classes as being generated from Gaussian distribution or a mixture 
of Gaussian distributions [3^ are commonly used as multi-class classifiers. We 
use a large margin discriminative algorithm for training the classifier introduced 
by Sha and Saul !4:. To ensure that the learned multi-class classifier preserves 
differential privacy, we modify the objective function by introducing a perturbed 
regularization term. 

2 Differential Privacy 

In recent years, the differential privacy model proposed by Dwork, et al. [I] has 
emerged as a robust standard for data privacy. It originated from the statistical 
database model, where the dataset Z? is a collection of elements and a ran- 
domized query mechanism M produces a response when performed on a given 
dataset. Two datasets D and D' differing by at most one element are said to be 
adjacent. There are two proposed definitions for adjacent datasets one based on 
symmetric difference - D' containing of one entry less than D, and one based 
on substitution - one entry of D' differs in value from D. We use the substi- 
tution definition of adjacency previously used by |5I2) . where the one entry of 
the dataset D = {xi, . . . , Xn-i, x„} is modified to result in an adjacent dataset 
D' = {xi, . . . ,x„_i,x^}. The query mechanism M is said to satisfy differential 
privacy if the probability of M resulting in a solution S when performed on a 
dataset D is very close to the probability of M resulting in the same solution S 
when executed on an adjacent dataset D' . Assuming the query mechanism to be 
a function M : D ^^ range(M) with a probability function P defined over the 
space of Af , differential privacy is formally defined as follows. 

Definition 1. A randomized function M satisfies e- differential privacy if for all 
adjacent datasets D and D' and for any S € range{M), 
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The value of the e parameter, which is referred to as leakage, determines the 
degree of privacy. As there is always a trade-off between privacy and utility, the 
choice of e is motivated by the requirements of the application. 

In a machine learning setting, the query mechanism can be thought of as an 
algorithm learning the classification, regression or density estimation rule which 
is evaluated over the training dataset. The output of an algorithm satisfying dif- 
ferential privacy is likely to be same when the value of any single dataset instance 
is modified, and therefore, no additional information can be obtained about any 



individual training data instances with certainty by observing the output of the 
learning algorithm, beyond what is already known to an adversary. Differential 
privacy is a strong definition of privacy - it provides ad omnia guarantee as 
opposed to most other models that provide ad hoc guarantees against specific 
set of attacks and adversarial behaviors. 



2.1 Related Work 

The earlier work on differential privacy was related to functional approxima- 
tions for simple data mining tasks and data release mechanisms |6I7I8I9J . Al- 
though many of these works have connection to machine learning problems, 
more recently the design and analysis of machine learning algorithms satisfying 
differential privacy has been actively studied. Kasiviswanathan, et al. [5] present 
a framework for converting a general agnostic PAC learning algorithm to an al- 
gorithm that satisfies privacy constraints. Chaudhuri and Monteleoni [5] use the 
exponential mechanism |10) to create a differentially private logistic regression 
classifier by adding Laplace noise to the estimated parameters. They propose 
another differentially private formulation which involves modifying the objec- 
tive function of the logistic regression classifier by adding a linear term scaled 
by Laplace noise. The second formulation is advantageous because it is indepen- 
dent of the classifier sensitivity which difficult to compute in general and it can 
be shown that using a perturbed objective function introduces a lower error as 
compared to the exponential mechanism. 

However, the above mentioned differentially private classification algorithms 
only address the problem of binary classification. Although it is possible to ex- 
tend binary classification algorithms to multi-class using techniques like one-vs- 
all, it is much more expensive to do so as compared to a naturally multi-class 
classification algorithm. Jagannathan, et al. |llj present a differentially private 
random decision tree learning algorithm which can be applied to multi-class 
classification. Their approach involves perturbing leaf nodes using the sensitiv- 
ity method, and they do not provide theoretical analysis of excess risk of the 
perturbed classifier. In this paper, we propose a modification to the naturally 
multi-class large margin Gaussian classification algorithm [4112) . 



3 Large Margin Gaussian Classifiers 

We investigate the large margin multi-class classification algorithm introduced 
by Sha and Saul ;4l. The training dataset (a;,j/j^ contains n iid d-dimensional 
training data instances Xi £ R'^ each with labels yi € {1, . . . , C}. We consider 
the setting where each class is modeled as a single Gaussian ellipsoid. Each class 
ellipsoid is parametrized by the centroid fic G K'', the inverse covariance matrix 
^c <= M''^'^, and a scalar offset dc > 0. The decision rule is to assign an instance 



Notation: vectors and matrices are denoted by boldface. 



Xi to the class having smallest Mahalanobis distance il3j with the scalar offset 
from Xi to the centroid of that class. 

yi = argmin {xi - /Xc)^*c(iCi - Mc) + ^c- (1) 

c 

To simplify the notation, we expand {xi — fic)'^^c{xi — /Xc) and collect the 
parameters for each class as the following (d+ I) x (d + I) positive semidefinite 
matrix 






(2) 



and also append a unit element to each d-dimensional vector Xi. The decision 
rule for a data instance Xi simplifies to 

yi = argmin xf^cXi. (3) 

c 

The discriminative training procedure involves estimating a set of positive semidef- 
inite matrices {$1, . . . , *c} from the training data {(a^i, j/i), . . . , (a;„,2/„)} which 
optimize the performance on the decision rule mentioned above. We apply the 
large margin intuition that the optimal classifier must maximize the distance of 
training data instances from the decision boundaries. This leads to the classi- 
fication algorithm being robust to outliers with provably strong generalization 
guarantees. Formally, we require that for each training data instance Xi with 
label yi, the distance from Xi to the centroid of class yi is at least less than its 
distance from centroids of all other classes by one. 



Mc^yi'. x\ ^cXi > 1 + a;/ * 
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Analogous to support vector machines, the training algorithm is an optimization 
problem minimizing the hinge loss denoted by [/]+ — max(0,/), with a linear 
penalty for incorrect classification. We use the sum of traces of inverse covariance 
matrices for each classes as a regularization term. The regularization requires 
that if we can learn a classifier which labels every training data instance correctly, 
we choose the one with the lowest inverse covariance or highest covariance for 
each class ellipsoid as this prevents the classifier from over-fitting. The parameter 
A controls the trade off between the loss function and the regularization. 

J(*, a;, y) = ^ ^ [1 + xj{^y^ - ^,)x.] + + A ^ trace(*e). (4) 

i c^yi c 

The inverse covariance matrix "^c is contained in the upper left size d x d block 
of the matrix $c- We replace it with I.^#cl*, where 1$ is the truncated size 
{d + 1) X [d + 1) identity matrix with the last diagonal element Iip^^-^ ^_|_i set to 
zero. The optimization problem becomes 

J(*, x,y)^^Y^ [1 + a;f (*.y, - ^c)xi\ + + A ^ trace(I** J*) 

i c^yi c 

= L{^,x,y)+N{^). (5) 



The hinge loss being non-difFerentiable is not very convenient for our analysis; 
we replace it with a surrogate loss function called Huber loss Ih [S] which has 
similar characteristics to the hinge loss for small values of h. 



^h V^C; *^i ; Vi) 



if XJ ($c — ^yi)Xi > h, 

^[h-xj{^y^-i^,)x,]' if |a;f(*,-*,)a;,|<;^ (6) 

-xf{^y^ - ^c)Xi if xf{^c - ^y,)Xt < -h. 



The objective function is convex function of positive semidefinite matrices $c- 
The optimization can be formulated as a semidefinite programming problem |15) 
and be solved efficiently using interior point methods. 

The large margin classification framework can be easily extended to model- 
ing each class with a mixture of Gaussians. Similar to support vector machines, 
when training with non-separable data, we can introduce slack parameters to 
permit margin violations. These extensions do not change the basic characteris- 
tics of the learning algorithm. The optimization problem remains to be a convex 
semidefinite program with piecewise linear terms and is equally tractable. For 
simplicity, we restrict our discussion to single Gaussians and hard margins in 
this paper. As we shall see, it is easy to extend our proposed modifications to 
these cases. 



4 Differentially Private Large Margin Gaussian Classifiers 

We modify the large margin Gaussian classification formulation to satisfy differ- 
ential privacy by introducing a perturbation term in the objective function. As 
we will see in Section 15. 2[ this modification leads to a classifier that preserves 
differential privacy. 

We generate the size (d+l) x (d + 1) perturbation matrix b with density 

P(b)(xexp(~|||b||), (7) 

where || • |j is the Frobenius norm (element- wise £2 norm) and e is the privacy 
parameter. One method of generating such a b matrix is to sample the norm 
||b|| from /^((d-f-l)^,-) and the direction of b at random. 

Our proposed learning algorithm minimizes the following objective function 
Jp($, a;, y), where the subscript p denotes privacy. 



Jp{^,x,y) = L{^,x,y) + A^ trace(l#*cl*) -I- y^ y^ hj-Pcij 

c c ij 

= ./(*,a;,y) + ^^6y<?ezj. (8) 



u 



As the dimensionality of the perturbation matrix b is same as that of the clas- 
sifier parameters $c, the parameter space of $ does not change after pertur- 
bation. In other words, given two datasets {x,y) and {x',y'), if $p minimizes 



Jp{^,x,y), it is always possible to have $p minimize Jp{^,x' ,y'). This is a 
necessary condition for the classifier $p satisfying differential privacy. 

Furthermore, as the perturbation term is convex and positive semidefinite, 
the perturbed objective function Jp{^, x, y) has the same properties as the un- 
perturbed objective function J{^,x,y). Also, the perturbation does not intro- 
duce any additional computational cost as compared to the original algorithm. 

5 Theoretical Analysis 

5.1 Proof of DiflFerential Privacy 

In the following theorem, we prove that the classifier minimizing the perturbed 
optimization function Jp{^,x,y) satisfies e-differential privacy. Given the dataset 
{x,y) = {(a;i,yi),...,(a;„_i,2/„_i),(a;„,y„)}, the probability of learning the 
classifier $p is close to the the probability of learning the same classifier $p 
given its adjacent dataset {x' , y') = {{xi,yi), ..., (a;„_i, y„_i), {x',^,y',^)} dif- 
fering wlog on the n"^ instance. As we mentioned in the previous section, it 
is always possible to find such a classifier $p minimizing both Jp{^,x,y) and 
Jp{^,x' ,y') due to the perturbation matrix being in the same space as the 
optimization parameters. 

Our proof requires a strictly convex perturbed objective function resulting 
in a unique solution ^p minimizing it. This in turn requires that the loss func- 
tion L(^,x,y) is strictly convex and differentiable, and the regularization term 
iV($) is convex. These seemingly strong constraints are satisfied by many com- 
monly used classification algorithms such as logistic regression, support vector 
machines, and our general perturbation technique can be extended to those algo- 
rithms. In our proposed algorithm, the Huber loss is by definition a differentiable 
function and the trace regularization term is convex and differentiable. Addition- 
ally, we require that the difference in the gradients of L(#, x, y) calculated over 
for two adjacent training datasets is bounded. We prove this property in Lemma 
[1] given in the appendix. 

Theorem 1. For any two adjacent training datasets {x,y) and (x'.y'), the 
classifier $p minimizing the perturbed objective function Jp{^,x,y) satisfies 
differential privacy. 



P{^P\x,y) 

log — ■ — 

^ P{^P\x',y') 



<e', 



where e' = e + k for a constant factor k = log ( 1 + ^ + -^^jr ) with a constant 
value of a. 

Proof. As J{^,x,y) is convex and differentiable, there is a unique solution #* 
that minimizes it. As the perturbation term X^cSi, ^ij^dj is also convex and 
differentiable, the perturbed objective function Jp{'^, x, y) also has a unique 



solution ^P that minimizes it. Differentiating Jp{^,x,y) wrt $c, we have 



d d 

Jp{^, X, y) = ^^L(*, X, y) + AI* + b. 



9$, "^ '"'^ a* 

Substituting the optimal $^ in the derivative gives us 

d 



(9) 



AI# + b = 



a*, 



■L(*^a:,y) 



This relation shows that two different values of b cannot result in the same 
optimal $^. As the perturbed objective function Jp{^,x,y) is also convex and 
differentiable, there is a bijective map between the perturbation b and the unique 
$P minimizing Jp(^,x,y). 

Let bi and b2 be the two perturbations applied when training with the 
adjacent datasets {x,y) and {x',y'), respectively. Assuming that we obtain the 
same optimal solution $p while minimizing both Jp(^,x,y) with perturbation 
bi and Jp{^,x,y) with perturbation b2, 



AI^ + bi 
AI# + ba 



d 
d 



L{^P,x',y'), 



We apply Lemma [1] after taking Frobenius norm on both sides. 



(10) 
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^ Li^P,x',y')-^Li^P,x,y) 



a* 



d^r 






d^c 



d^c 



d_ 



d_ 
5*c 



a* 



< 2, 



^ L(*^<,y:j--^L(*P,a;„,y„) 



a*c 



Using this property, we can calculate the ratio of densities of drawing the 
perturbation matrices bi and b2 as 

P(b = bi) 5II3(ral|bircxp[-|||bi||] 



P{h = b2) 



surf(||b2||) I 



' exp 



where surf(||b||) is the surface area of the {d + l)-dimensional hypersphere with 
radius ||b||. As surf(||b||) — surf(l)||b|j'', where surf(l) is the area of the unit 



{d + l)-dimensional hypersphere, the ratio of the densities becomes 
P(b = bi) 



P(h = b2) 



exp 



e 



2 



:^(||b2||-||bi||) <exp -||b2-bi|| <exp(6). (11) 



The ratio of the densities of learning $p using the adjacent datasets (a;, y) 
and {x',y') is given by 



P{^P\x,y) P(b = bi) |dct(J(*P->bi|a;,y))|- 



Pi^p\x',y') P(b = b2)|det(J(*P^b2|a;',y'))h 



(12) 



where J($p — ?> bi|a;, y) and J(#p — >■ b2|a;',y') are the Jacobian matrices of the 
bijective mappings from ^p to bi and b2, respectively. Following a procedure 
identical to Theorem 2 of [16] (omitted due to lack of space), it can be shown 
that the ratio of Jacobian determinants is upper bounded by a constant factor 
exp(fc) = 1 + |j + ;^TX2- for a constant value of a. Therefore, the ratio of the 
densities of learning *p using the adjacent datasets becomes 

^ ^ ' <exp(e + fc)=exp(e'). (13) 



P(*P|a;',y') 



Similarly, we can show that the probability ratio is lower bounded by exp(— e'), 
which together with Equation (|13p satisfies the definition of differential privacy. 

D 



5.2 Analysis of Excess Error 

In the remainder of this section, we denote the terms J(4>,x,y) and iy($,x, y) 
by J($) and L{^) respectively for conciseness. To establish a bound on excess 
risk of the classifier given by the proposed algorithm minimizing the perturbed 
objective function, in Lemma[2]we show that the objective function J(*) satisfies 
strong convexity. The objective function J($) contains the loss function L{^) 
computed over the training data (x, y) and the regularization term N(^) - this 
is known as the regularized empirical risk of the classifier $. In the following 
theorem, we establish a bound on the regularized empirical excess risk of the 
differentially private classifier minimizing the perturbed objective function over 
the classifier minimizing the unperturbed objective function. 

Theorem 2. With probability at least l — S, the regularized empirical excess risk 
of the classifier $p minimizing the perturbed objective function Jp($) over the 
classifier $* minimizing the unperturbed objective function J($) is bounded as 



Proof. We use the definition of Jp(*) = J{^) + J2c 'l2ij ^ij^dj and tlie opti- 
mality of *p, i.e., Jp(*P) < Jp(**). 



c ij c ij 

j(*p)<j(**) + EE^*^('^-,-'Z': 






(14) 



C IJ 



Using the strong convexity of J($) as given by Lemma [5] and the optimality of 
J($*), we have 



„*.),., £i±±:),£2!)±^_ ^2 II*;. *a.. 



A 



j(*p)-j(**)>-Eii*:-*: 



,P||2 



(15) 



Similarly, using the strong convexity of Jp(^) and the optimality of Jp ($'"), 

.,,*.)<., (2^) < i(2!)±im _ ^ ^ II*. _ *:iP. 

C 

Substituting the definition Jp(*) = J{^) + X^c Sn ^ij^cij, 

j(**)+EE^^^'^:^.-'^(*')-EE^^^'^«,^^Eii*:-*?ii' 

c ij c ij c 

E E ^^^■('^-, - o - (^(*^) - Ji^i) > ^ E ii*c - *?f • 

C ZJ c 

Substituting the lower bound on J(*^) — J($*) given by Equation p5l) , 

EE%(<^:^.-'^«,)>^Eii*c-*j;f, 



C IJ 



EE^^.('^:^.-'^c.,) 



> 



A2 



Eii**-*ci 



c ij 

Using the Cauchy-Schwarz inequality, we have, 

2 



(16) 



EE^^.(<^:..-'^c.,; 



c ij 



<q|b||2Ell*c-*?l 



(17) 



10 



Combining this with Equation (|16|) gives us 



Eii*c 



*p|p 






(18) 



Combining this with Equation PT|) gives us 



<.,)<^l|b|| 



We bound ||b||^ with probabihty at least 1 — (5 as given by Lemma H) 

8{d- 






c i'j 






log 



2A 



Substituting this in Equation ()14p proves the theorem 



(19) 



D 



The upper bound on the regularized empirical risk is in 0(-%). The bound 
increases for smaller values of e which implies tighter privacy and therefore sug- 
gests a trade off between privacy and utility. 

The regularized empirical risk of a classifier is calculated over a given training 
dataset. In practice, we are more interested in how the classifier will perform on 
new test data which is assumed to be generated from the same source as the 
training data. The expected value of the loss function computed over the data is 
called the true risk £($) = E[L($)] of the classifier #. In the following theorem, 
we establish a bound on the true excess risk of the differentially private classifier 
minimizing the perturbed objective function and the classifier minimizing the 
original objective function. 

Theorem 3. With probability at least 1 — S, the true excess risk of the classi- 
fier $P minimizing the perturbed objective function Jp{^) over the classifier #* 
minimizing the unperturbed objective function J($) is bounded as 



L{^P) < L(**) 



4:Vd{d+l)'^C 



i{d^ 



eX 



los 



2A 



log- 



16 
Xn 



32 + log 



Proof. Let the expected value of the regularized empirical risk be 
J(*) = L(*) + AEtrace(I**cI*)- 



(20) 



11 



Let 4>'" be the classifier minimizing J(*), i.e., J{^^) < >/(**)• 
Rearranging the terms, we have 

J(*P) = J(**) + [J(*P) - J(*'')] + [J(*'') - J(**)] 
< J(**) + [J(*P)- J(*'')]. 

Substituting the definition of J($), 

L(*'') + A^trace(I**?I*) < L(**) + A^ trace(I***I*) + [J(*p) - J(*'')] 

C C 

L(*P) < L(**) + A^ trace[I*(*: - *?)!*] + [J(*p) - JC*")]. (21) 

C 

From Lemma [3] and Equation fjlSp . we have, 



5]trace[I*(*: -*?)!*] 



c 

16<i(d+ 1)4(72 



<if^iibii' 



e2A2 



log' 



Taking the square root, 



2^trace[I*(*: - *?)!*] < log ( - 



eA 



(22) 



Sridharan, et al. [17) present a bound on the true excess risk; of any classifier 
as an expression of the bound on the regularized empirical excess risk for that 
classifier. With probability at least 1 — S, 



Ji^P) - J(*'') < 2[J(*P) - ./(**)] 
Substituting the bound from Theorem [^ 



^6 

An 



32 + log 



16 

An 



32 + log ( - 



(23) 



Substituting the results from Equations (p2)) and p3| into Equation (pit proves 
the theorem. 

D 

Similar to the bound on the regularized empirical excess risk, the bound on 
the true excess risk is also inversely proportional to e reflecting the privacy-utility 
trade-off. The bound is linear in the number of classes C, which is a consequence 
of the multi-class classification. The classifier learned using a higher value of the 
regularization parameter A will have a higher covariance for each class ellip- 
soid. This would also make the classifier less sensitive to the perturbation. This 
intuition is confirmed by the fact that the true excess risk bound is inversely 
proportional to A. 



12 

6 Conclusion 

In this paper, we present a discriminatively trained Gaussian classification algo- 
rithm that satisfies differential privacy. Our proposed technique involves adding 
a perturbation term to the objective function. We prove that the proposed al- 
gorithm satisfies differential privacy and establish a bound on the excess risk 
of the classifier learned by the algorithm which is inversely proportional to the 
data dimensionality which is directly proportional to the number of classes and 
inversely proportional to the privacy parameter e reflecting a trade-off between 
privacy and utility. 

In the future, we plan to extend this work along two main directions: extend- 
ing our perturbation technique for a general class of learning algorithms and 
applying results from theory of large margin classifiers to arrive at tighter excess 
risk bounds for the differentially private large margin classifiers. Our intuition is 
that compared to other classification algorithms, a large margin classifier should 
be much more robust to perturbation. This would also give us insights into 
designing low error inducing mechanisms for differentially private classifiers. 
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Appendix 

Lemma 1. Assuming all the data instances to lie within a unit £2 ball, the 
difference in the derivative of Huber loss function L{^,x,y) calculated over two 
data instances {xi,yi) and {x[,y[) is bounded. 



d d 

.^^L{^,x,,y,) - ^^L{^,x[,y[ 



< 2. 



Proof. The derivative of the Huber loss function for the data instance Xi with 
label yi is 



d 



L{^,Xi,yi 



if xf{^c-^yi)Xi> h, 

^[h- xj{^y^ - ^c)xi]x,xj if \xj{^c - ^yjx,\ < h, 

T 
Xj^X^ 



if xl{^c-^y^)x, < -h. 



The data points lie in a £2 ball of radius 1, Vi : ||a;i||2 < 1. Using linear algebra, 
it is easy to shovir that the Frobenius norm of the matrix Xixf is same as the £2 
norm of the vector Xi, \\xixj\\ = ||a;i||2 < 1- 

As the term jf^[h~xf{^y. --^c)xi] is a.t most one when\xf{^c~^yi)xi\ < h, 
the Frobenius norm of the derivative of the Huber loss function is at most one 

< 1. Using a similar argument for data instance 



in all cases, 



^L(*,a;j,y, 



a*c 



x'j^ with label y^, we have ■^-L{^,x[,yl 



< 1. 



Finally, using the triangle inequality \\a — b\\ = \\a + (— b)j| < ||a|| + j|6| 



d 



L{^,Xi,yi 



d 



M^,x[,y',) 



< 



d 



■L{^,Xi,yi 



d 



■i(*,^UO 



< 2. 



14 



D 
Lemma 2. The objective function J($) is X-strongly convex. For < a < 1, 

J (a* + (1 - a)*') < aJ(*) + (1 - a)J(*') - ^"^^ ~ "^ ^ ||*c - Kf ■ 

c 

Proof. By definition, Huber loss is A-strongly convex, i.e. 

L (a* + (1 - a)*') < aL(*) + (1 - a)i(*') - '^"^^ ~ "^ ||* - *'||^ . (24) 

where the Frobenius norm of the matrix set $ — $' is the sum of norms of the 
component matrices #c — ^cj 

ll*-*'ll' = ^ll*,-*'Jl^ (25) 

c 

As the regularization term iV(#) is linear, 

TV (a* + (1 - a)*') = A ^ trace(al**cl* + (1 - a)U^'J.^) (26) 

C 

= aA Y^trace(I^^cI*) + (1 — a)Ay^ trace(I^^'^I^) 

c c 

= aiV(*) + (l-a)7V(*')- 
The lemma follows directly from the definition J(#) = L{^) + N{^). 



D 



Lemma 3. 



1 
dC 



Y^trace[U{^,-^'JU] 



<^||*,-*'J 



Proof. Let ^c,i,j be the («, j)*^ element of the size (d+l) x (rf+1) matrix ^c — ^'c- 
By the definition of the Frobenius norm, and using the identity -/VX]i=i ^i — 

d+l d+l d+l d 



52. . 



C 2 — 1 J — 1 



> 



;^(EE'^-^0 =7r 



dC 



dC 



Etrace[I*(*c-*e)I*] 



D 



Lemma 4. 



2(d + l)^ /d 
|b||>^^logU 



<5. 



Proof. Similar to the union bound argument used in Lemma 5 in [2]. 



